(19) 





(12) 



(43) Date of publication: 

05.07.2000 Bulletin 2000/27 

(21) Application number: 99204116.0 

(22) Date of filing: 03.12.1999 



CITED BY APPLIC ANT 

Euro pii sen es Patentamt 

European Patent Office 

Office europeen des brevets (11) 

EUROPEAN PATENT APPLICATION 

(51) Int. CI. 7 : G06T7/00 






EP1 017 019 A2 



(84) 


Designated Contracting States: 


• Etz, Stephen, 


AT BE CH CY DE DK ES R FR GB GR IE IT LI LU 


c/o Eastman Kodak Company 




MCNLPTSE 


Rochester, New York 14650-2201 (US) 




Designated Extension States: 


• Singhal, Amlt, 




AL LT LV MK RO SI 


c/o Eastman Kodak Company 
Rochester, New York 14650-2201 (US) 


(30) 


Priority: 31.12.1998 US 223860 






(74) Representative: 


(71) 


Applicant: EASTMAN KODAK COMPANY 


Parent, Yves et al 




Rochester, New York 14650 (US) 


KODAK INDUSTRIE 
Departement Brevets - CRT 


(72) 


Inventors: 


Zone Industrielle 


• 


Luo, Jiebo, 


B.P. 21 




c/o Eastman Kodak Company 


71102 Chalon-sur-Saone Cedex (FR) 




Rochester, New York 14650-2201 (US) 






SEGMENTATION 



(54) Method for automatic determination of main subjects in photographic images 

(57) A method for detecting a main subject in an 
image, the method comprises: receiving a digital image; 
extracting regions of arbitrary shape and size defined by 
actual objects from the digital image; grouping the 
regions into larger segments corresponding to physi- 
cally coherent objects; extracting for each of the regions 
at least one structural saliency feature and at least one 
semantic saliency feature; and integrating saliency fea- 
tures using a probabilistic reasoning engine into an esti- 
mate of a belief that each region is the main subject. 
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Description 

HELD OF THE INVENTION 



™ ■ ^on re*** gene,* to the «rj ot dgM MO* r*°ce«*> "* * ^ 



BACKGROUND OF THE INVENTION 
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,000* , photographic p^, a main JJ- 

£2. due to the lack of specific knowledge ^^^^Sarty observers if the photographer has 
the like. On the other hand, there is. m ^J^JCSJE In subject to the viewers. Therefore, rt . pos- 
successfully used the picture to commumcate h.s orher ™J"™L main subjects in images 
sible to desfcn a method to automatical* P erform importance for different regions that are 

l00 03] Main subject detection provides a ™^« a ^Zna£e of the scene contents for a number 

7S2S ^"StfS- S« versions of the im*ge. semantic informat.cn. and 

S Tne methods disclosed by the prior art can ^nTe^™ 

SL because such methods were d ^ ^^3^ «te^ory is considered "region-based" 
^Mostp^sed^ 

periorrSance of a visual system is strong* fdSKSKJ transform (DST) are computed to measure 

forms named the discrete moment transform (DMT) and ^"^er to exdude trivial symmetry cases, nonuni- 

s^^^er^^^^c^r^^^ 

; by the DMT operator. imaae- From biology to implementation, PhD thesis, University 

[00061 R. Milanese. Detoctf/V ***** ' e 9' ons '"f^Sld TvSaTatterttion. which combines knowledge about 
of Geneva. Switzerland. 1993. developed a computational ^^™^ ctured into three major stages. First, mu - 
human visual system with computer vision «^J^2fSS*2n. curvature, color contrast and the IK^ 
iple feature maps are extracted from the input image (for using a derivative of Gaussian model, which 

o Second, a conesponding number d "^^.^^S^xatS process is used to integrate the conspi- 
enhance regions of interest in each ,ea ^ re f m H a f^ inter-map and intra-map inconsistences. The 

est . ^ i r Rovack etal U S. Patent No. 5,724,456, developed a sys- 

« SioT] Todet^neanoptimaltonauepr^ 

em that partitions the image into blocks, combines certain o o is |abele d an active sector if the 

of a destination application. explicitly detect region of interest corresponding to 

S^*?SS2t eUon or gtf* about ft. »-» ^ „ lmases , ln Pra , 
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terion (rejecting of regions having large number of pixels along the picture borders); (3) a face texture criterion (de- 
emphasizing regions whose texture does not correspond to skin samples); (4) a motion criterion (rejecting regions with 
no motion and low gradient or regions with very large motion and high gradient); and (5) a continuity criterion (temporal 
stability in motion). The main application of this method is for directing the resources in video coding, in particular for 
videophone or videoconference. It is clear that motion is the most effective criterion for this technique targeted at video 
instead of still images. Moreover, the fuzzy logic functions were designed in an ad hoc fashion. Lastly, this method 
requires a window predefined by a human operator, and therefore is not fully automatic. 

[0010] W. Osberger. et al. a "Automatic identification of perceptually important regions in an image." in Proa IEEE 
Int. Conf Pattern Recognition, 1 998, evaluated several features known to influence human visual attention for each 
region of a segmented image to produce an importance value for each feature in each region. The features mentioned 
include low-level factors (contrast, size, shape, color, motion) and higher level factors (location, foreground/background, 
people, context), but only contrast, size, shape, location and foreground/background (determining background by deter- 
mining the proportion of total image border that is contained in each region) were implemented. Moreover, this method 
chose to treat each factor as being of equal importance by arguing that (1 ) there is little quantitative data which indicates 
the relative importance of these different factors and (2) the relative importance is likely to change from one image to 
another. Note that segmentation was obtained using the split-and-merge method based on 8 x 8 image blocks and this 
segmentation method often results in over-segmentation and blotchiness around actual objects. 
[0011] r - Q. Huang, et al.. "Foreground/background segmentation of color images by integration of multiple cues," in 
Proc. IEEE Int. Conf. Image Process.. 1995, addressed automatic segmentation of color images into foreground and 
background with the assumption that background regions are relatively smooth but may have gradually varying colors 
or be lightly textured. A multi-level segmentation scheme was devised that included color clustering, unsupervised seg- 
mentation based on MDL (Minimum Description Length) principle, edge-based foreground/background separation, and 
integration of both region and edge-based segmentation/In particular, the MDL-based segmentation algorithm was 
used to further group the regions from the initial color clustering, and the four corners of the image were used to adap- 
tively determine an estimate of the background gradient magnitude. The method was tested on around 100 well-com- 
posed images with prominent main subject centered in the image against large area of the assumed type of uncluttered 
background. 

[0012] T. F. Syeda-Mahmood. "Data and model-driven selection using color regions," Int J. Comput Vision , vol. 21 , 
no. 1, pp. 9-36. 1997. proposed a data-driven region selection method using color region segmentation and region- 
based saliency measurement. A collection of 220 primary color categories was pre-defined in the form of a color LUT 
(look-up-table). Pixels are mapped to one of the color categories, grouped together through connected component 
analysis, and further merged according to compatible color categories. Two types of saliency measures, namely self 
saliency and relative saliency. are linearly combined using heuristic weighting factors to determine the overall saliency. 
In particular, self-saliency included color saturation, brightness and size while relative saliency included color contrast 
(defined by CIE distance) and size contrast between the concerned region and the surrounding region that is ranked 
highest among neighbors by size, extent and contrast in successive order. 

[001 3] In summary, almost all of these reported methods have been developed for targeted types of images: video- 
conferencing or TV news broadcasting images, where the main subject is a talking person against a relatively simple 
static background (Osberg, Marichal); museum images, where there is a prominent main subject centered in the image 
against large area of relatively clean background (Huang); and toy-world images, where the main subject are a few dis- 
tinctively colored and shaped objects (Milanese, Syeda). These methods were either not designed for unconstrained 
photographic images, or even if designed with generic principles were only demonstrated for their effectiveness on 
rather simple images. The criteria and reasoning processes used were somewhat inadequate for less constrained 
images, such as photographic images. 

SUMMARY OF THE INVENTION 

[001 4] It is an object of this invention to provide a method for detecting the location of main subjects within a digitally 
captured image and thereby overcoming one or more problems set forth above. 

[0015] It is also an object of this invention to provide a measure of belief for the location of main subjects within a 
digitally captured image and thereby capturing the intrinsic degree of uncertainty in determining the relative importance 
of different subjects in an image. The output of the algorithm is in the form of a list of segmented regions ranked in a 
descending order of their likelihood as potential main subjects for a generic or specific application. Furthermore, this list 
can be converted into a map in which the brightness of a region is proportional to the main subject belief of the region. 
[0016] It is also an object of this invention to use ground truth data. Ground truth, defined as human outlined main 
subjects, is used to feature selection and training the reasoning engine. 

[001 7] It is also an object of this invention to provide a method of f inding main subjects in an image in an automatic 
manner. 
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mates of the scene characteristics. 

[0020] The present invention comprises the steps ot . 

is the main subject. 

designate identical elements that are common to the figures. 
ADVANTAGEOUS EFFECT OF THE INVENTION 
20 [0022] The present invention has the following advantages of: 

ESSSZ*-, rep— n of the ground trufr . which capture the inherent uncertainty in deterring 

r-^ss^ - re,ative importence of drf,erent fea * res 

30 l££ ground truth collection and 

extensive, robust feature extraction ^^SSESfti latter facilitated by explicrt identification of key fore- 
combination of structural saliency and semantic saiiency, me 
ground- and background- subject matters; ^dural saliency features; and, 

. hssz ^-srssr^nsss ~ — 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0023] 

Fig. 
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Sd Ffa 4(c) is an illustration of the PDF in the form of * W* 0 ^ £ * „ iLttfen of the PDF in the form 

S 5?s an illustration of the .ocation ^^^S^J^X^ *** * e ***"■ «' ^ 
of a 2D function. Fig. 5(b) is an .llustrat.cn of the PDF . «*• «™ ™ £ eight direction ; 

Fig. 8 is block diagram of a preferred segmentation method. 
DETAILED DESCRIPTION OF THE INVENTION 
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ware program. Those skilled in the art will readily recognize that the equivalent of such software may also be con- 
structed in hardware _ . 
[00251 Still further, as used herein, computer readable storage medium may compnse. for example; magnetic stor- 
aae media such as a magnetic disk (such as a floppy disk) or magnetic tape; optical storage media such as an optical 
disc optical tape or machine readable bar code; soBd state electronic storage devices such as random access memory 
(RAM) or read only memory (ROM); or any other physical device or medium employed to store a computer program. 
[00261 Referring to Fig. 1. there is illustrated a computer system 10 for implementing the present invention. 
Although the computer system 10 is shown for the purpose ol illustrating a preferred embodiment, the present invention 
is not limited to the computer system 10 shown, but may be used on any electronic processing system The computer 

7 svstem 1 0 includes a microprocessor based unit 20 for receiving and processing software programs and for performing 
other processing functions. A touch screen display 30 is electrically connected to the microprocessor -based unit 20 for 
displaying user related information associated with the software, and for receiving user input via touching the screen^ 
keyboard 40 is also connected to the microprocessor based unit 20 for permitting a user to input information to the soft- 
ware As an alternative to using the keyboard 40 for input, a mouse 50 may be used for moving a selector 52 on the 

5 display 30 and for selecting an Hem on which the selector 52 overlays, as is well known in the art. 

[00271 A compact disk-read only memory (CD-ROM) 55 is connected to the microprocessor based unit 20 for 
receiving software programs and for providing a means of inputting the software programs and other information to ttie 
microprocessor based unit 20 via a compact disk 57. which typically includes a software program. In addition, a f loppy 
disk 61 may also include a software program, and is inserted into the microprocessor based unit 20 for inputting the 

o software program. Still further, the microprocessor based unit 20 may be programmed, as is well know intheart, for 
storing the software program internally. A printer 56 is connected to the microprocessor based unit 20 for pnnting a 
hardcopy of the output of the computer system 10. 

100281 Images may also be displayed on the display 30 via a personal computer card (PC card) 62 or. as it was for- 
merly known a personal computer memory card international association card (PCMCIA card) which contains digged 
5 images electronically embodied the card 62. The PC card 62 is ultimately inserted into the microprocessor based unit 
20 for permitting visual display of the image on the display 30. 

[00291 Referring to Fig. 2, there is shown a block diagram of an overview of the present invention. First, an input 
image of a natural scene is acquired and stored SO in a digital form. Then, the image is segmented S2 into a few regions 
of homogeneous properties. Next, the region segments are grouped into larger regions based on similarity measures 

w S4 through non-purpose perceptual grouping, and further grouped into larger regions conesponding to perceptually 
coherent objects S6 though purposive grouping (purposive grouping concerns specific objects). The regions are eval- 
uated for their saliency S8 using two independent yet complementary types of saliency features - structural saliency 
features and semantic saliency features. The structural saliency features, including a set of low-level early vision fea- 
tures and a set of geometric features, are extracted S8a. which are further processed to generate a set of serf saliency 

35 features and a set of relative saliency features. Semantic saliency features in the forms of key subject ^tters.which 
are likely to be part of either foreground (for example, people) or background (for example, sky. grass) are detected S8b 
to provide semantic cues as well as scene context cues. The evidences of both types are integrated S10 using a rea- 
soning engine based on a Bayes net to yield the final belief map erf t^^ 

[0030] To the end of semantic interpretation of images, a single criterion is clearly insufficient. The human brain. 
ao furnished with its a priori knowledge and enormous memory of real world subjects and scenarios^mbines different 
subjective criteria in order to give an assessment of the interesting or primary subject(s) in a scene. The blowing exten- 
sive list of features are believed to have influences on the human brain in performing such a somewhat intangible tasK 
as main subject detection: location, size, brightness, colorfulness. texturefulness. key subject matter, shape, symmetry, 
spatial relationship (surroundedness/ocdusipn), borderness. indoor/outdoor, orientation, depth (when applicable), and 

45 motion (when applicable for video sequence). 

[0031 1 In the present invention, the low-level early vision features include color, brightness, and texture. Tine geo- 
metric features include location (centrality), spatial relationship (borderness, adjacency, sunoundedness. and occlu- 
sion) size shape, and symmetry. The semantic features include flesh, face, sky. grass, and other green vegetation. 
Those skilled in the art can define more features without departing from the scope of the present invention. 
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S2: Region Rpgmentation 



[00321 The adaptive Bayesian color segmentation algorithm (Luo et al.. Towards physics-based segmentation of 
photographic color images." Proceedings of the IEEE International Conference on Image Processing 1997) is used to 
generate a tractable number of physically coherent regions of arbitrary shape. Although this segmentation method is 
preferred it will be appreciated that a person of ordinary skill in the art can use a different segmentation method to 
obtain object regions of arbitrary shape without departing from the scope of the present invention. Segmentation o aibi- 
trarily shaped regions provides the advantages of: (1) accurate measure of the size, shape, location of and spatial rela- 
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segmentation of the image into reg.ons .s obtained 88&A co '« colors in the image. Each pixel of the .mage 

tioned into a plurality of clusters that correspond tc £££££ physics-based color distance metric with 
is classif ied to the closest cluster in the color space SSE5SS segmentation of photographic color 

respect to the mean values of the cotor clusters ^^^^^essing, 1997). This classification process 
images," Proceedings of the IEEE International C o^ence °n tomgmmj. ; ^ ^ tQ determjne 

results in an initial segmentation of the .mage. A ^^^^^^SL. The window size is initially set at 
what neighborhood pixels are used to c^Mj SSS^JJ^ft. one for the entire image and does 
the size of the entire image S52, so that the '^^iT^SedTeSeen two alternating processes: recomputing 
not need to be recomputed. Next, an rteratjve procedure ■ p J^JJJJ^ and ^classifying the pixels accord- 
S54 the local mean values of each color ^^^JSSSSSS^ is performed until a convergence is 
ing to the updated local mean-values of celor^classes **Vn» ^T^Saints can be adjusted in a gradual manner 
reached S60. During this iterative procure J J^^^^^sLnts, is increased lineany with each 
S58 (for example, the value of p. which Ee. the window used to estimate the local mean 
iteration). After the convergence is reached for a P^bcutor w ™ 5 ^ repeated for the reduced window s.ze 
values for color classes is reduced by half m size S62. ^^^eprcceou e ? introduces spatial adap- 

toaHow more accurate estimation th t- ,0< ^ , ^^oT<^e t.^^Wo^ainecl when the iterative procedure 
tivity into the segmentation process. Finally, segmentation of -mag 
reaches convergence for the minimum window size S64. 

ft S6: EfflfiaatUal ^rouoina 

disss? ffi-sssssssraas ^ ^ - — - 

o example, a person has head, torso and 1,n * >s )- hiah-level vision features. Without proper perceptual group- 

[0035] Perceptual grouping fac.l.tates the reason c"**** properties as size and shape. Perceptual 

regions, and model of specific object (purposive grouping). 



fift: Feature Extraction 
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. „i fM t, roc which are shown to contribute to visual attention, are extracted 
l0 036] For each region, an extensive set* features f . ^ ^ insists of three categories - low-level vision lea- 
and associated evidences are then ranged Jhe 1st ^features cons s self . saliency featU re and a rela- 

tures. geometric features, and semantic ^^^^J^S^ subjects that stand out by themselves (for 
tive saliency feature are computed. JLicy is used to capture subjects that are in h,gh 

example, in color, texture, location and the take), wMeth , ™™°™j£ jremerits features, self salient or relatively 
contrast to their surrounding (for example, shape). Fu * e ^J2^Swn [0 1 01. by belief sensor functions with 
salient, are converted into evidences, whose belief sensor function 

appropriate nonlinearity characteristics. Refernng to F.g_ 3 there is shown g ^ ^ & maximum 

sirs kress; zsss^sz— — •* is ate0 * 

some features, as will be described hereinbelow. 
Structural eopanry features 

imm M safcnc, Mi indud. » * — » - — «• «— ^ 
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appropriate factor (for example. 2). All regions intersecting this stretched MBR 12, which is indicated by the dotted Ones, 
are considered neighbors of the region. This extended neighborhood ensures adequate context as well natural scala- 
bility for computing the relative saliency features. 
[0039] The following structural saliency features are computed: 

• contrast in hue (a relative saliency feature) 

* 

[0040] In terms of color, the contrast in hue between an object and its surrounding is a good indication of the sali- 
ency in color. 

contrast ^ = Z hue surmunding 

neighborhood 

where the neighborhood refers to the context previously defined and henceforth. 

. colorfulness (a self-saliency feature) and contrast in colorfulness (a relative saliency feature) 

20 [0041 1 In terms of colorfulness, the contrast between a colorful object and a dull surrounding is almost as good an 
indicator as the contrast between a dull object and a colorful surrounding. Therefore, the contrast in colorfulness should 
always be positive In general, it is advantageous to treat a serf saliency and the corresponding relative sal.ency as sep- 
arate features rather than combining them using certain heuristics. The influence of each feature will be determined 
separately by the training process, which will be described later. 

colorfulness = saturation . ( 2 ) 

^saturation - saturation sunvundi ng^ ^ 

Contrast ntorMness = saturation surrounding 

. brightness (a self -saliency feature) and contrast in brightness (a relative saliency feature) 

[00421 In terms of brightness, the contrast between a bright object and a dark surrounding is almost as good as the 
contrast between a dark object and a bright surrounding. In particular, the main subject tends to be lit up in flash scenes. 

brightness = luminance ( 4 ) 



25 
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35 



_ | brightness - brightness surTOUn din9 1 (5) 
40 contrast tightness- brightness 3urrounding . 

* 

• texturefulness (a self-saliency feature) and contrast in texturefulness (a relative saliency feature) 

45 [00431 In terms of texturefulness, in general, a large uniform region with very little texture tends to be the back- 
around On the other hand, the contrast between a highly textured object and a nontextured or less textured surround- 
ing is a good indication of main subjects. The same holds for a non-textured or less textured object and a highly textured 
surrounding. 



50 



texturefulness = texture_energy 



(6) 



55 



\texturefulness-texturefulness surround ing\ ^ 

Contrast tex(U refulness ~ texturefulness surr0 unding 

• location (a self-saliency feature) 

[0044] In terms of location, the main subject tends to be located near the center instead of the peripheral of the 
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cajof its size and shape. A senW«, measu™ s £~d LtW in «*» the .main subject 

a varying degree depending on its location. 
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centrality = tt- 2 PDF MSD.tocatfon( x ' 



(8) 

PsJt" MStLtocatfonV- 1 ' /i 
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■ r, ki n, .mhf»r of Dixels in region R, and PDF MS D_iocation denotes a 2D 

where (x.y) denotes a pixel in the reg.on R. Nr *^^^MUIaS^ tne PDF iS £2 

probability density function (PDF) of ma.n subject locat on_ ^ o e orie nta1ion-independent centrality 

the center of the image in both vertical and hor.zonta. ^^JJJ^SSn in the width and height directions are 
measure. An orientation-unaware PDF .s shown .n £,own. the PDF is symmetric about the center of 

also shown in Fig. 4(b) and F.g. 4(c). res P«*"*V ^J^dS^ which resu ,ts in an orientation-aware centrally 

™; n n sas^jss: * *. . - — *— - 

also shown in Fig. 5(b) and Fig. 5(c), respectively. 



25 . size (a self saliency feature) 



• size \<a sen oc...^..~/ ' 

[00461 ^ s^eas she* have considerate bu, '^^J^ZSL"!^^^^^' 



30 counted. 



35 



0 if s>sA 

x _ 5 ~ 52 i/ 5 > 53 and 5 < 54 



size = s 
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! 53 52 ,/5>52am/s<53 ( 9 ) 

5-51 i/ 5 > s\ and s<s2 
s2-s\ 

0 ifs<s\ 



45 [0047] 

scaling. 



region (10) 
s/ze = image pixels 
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[0M81 „ „ s «*a « — - is d-» » 0~ o, ,h,.e »n, ie^ed W, and W 

using two thresholds s2 and s3, where s2 < s3. 



55 • shape (a self -saliency feature) and contrast in shape (a relative saliency feature) 
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contrast in shape indicates conspicuity (for example, a child among a pool of bubble balls). 

[0050] The shape features are divided into two categories, self salient and relatively salient. Serf salient features 
characterize the shape properties of the regions themselves and relatively salient features characterize the shape prop- 
erties of the regions in comparison to those of neighboring regions. 
5 [0051 ] The aspect ratio of a region is the major axis/minor axis of the region. A Gaussian belief function maps the 
aspect ratio to a belief value. This feature detector is used to discount long narrow shapes from being part of the main 

subject. , . 

[0052] Three different measures are used to characterize the convexity of a region: (1) perimeter-based - perimeter 
of the convex hull divided by the perimeter of region; (2) area-based - area of region divided by the area of the convex 
w hull - and (3) hyperconvexity - the ratio of the perimeter-based convexity and area-based convexity. In general, an object 
of complicated shape has a hyperconvexity greater than 1 .0. The three convexity features measure the compactness 
of the region. Sigmoid belief functions are used to map the convexity measures to beliefs. 

[0053] The rectangularity is the area of the MBR of a region divided by the area of the region. A sigmoid belief func- 
tion maps the rectangularity to a belief value. The circularity is the square of the perimeter of the region divided by the 
75 area of region. A sigmoid beJief function maps the circularity to a belief value. 

[0054] Relative shape-saliency features include relative rectangularity. relative circularity and relative convexity. In 
particular, each of these relative shape features is defined as the average difference between the corresponding self 
salient shape feature of the region and those of the neighborhood regions, respectively. Finally, a Gaussian function is 
used to map the relative measures to beliefs. 



20 



symmetry (a self-saliency feature) 



[0055] Objects of striking symmetry, natural or artificial, are also likely to be of great interest. Local symmetry can 
be computed using the method described by V. D. Gesu. et aL, "Local operators to detect regions of interest," Pattern 
25 Recognition Letters, vol. 1 8. pp. 1 077-1 081 , 1 997. 

• spatial relationship (a relative saliency feature) 

[0056] In general, main subjects tend to be in the foreground. Consequently, main subjects tend to share bounda- 
30 ries with a lot of background regions (background clutter), or be enclosed by large background regions such as sky, 
grass snow wall and water, or occlude other regions. These characteristics in terms of spatial relationship may reveal 
the region of attention. Adjacency, surroundedness and occlusion are the main features in terms of spatial relationship. 
In many cases, occlusion can be inferred from T-junctions (L. R. Williams. "Perceptual organization of occluding con- 
tours " in Proc. IEEE Int. Conf. Computer Vision, 1990) and fragments can be grouped based on the principle of per- 
35 ceptual occlusion (J. August, et al.. "Fragment grouping via the principle of perceptual occlusion." in Proc. IEEE Int 
Conf. Pattern Recognition, 1996). 

[0057] In particular, a region that is nearly completely surrounded by a single other region is more likely to be tne 
main subject Surroundedness is measured as the maximum fraction of the region's perimeter that is shared with any 
one neighboring region. A region that is totally surrounded by a single other region has the highest possible surround- 
40 edness value of 1 .0. 

length jofjcommon_ border - - . 

surroundedness— max ; : 

neighbors region ^perimeter 



so • borderness (a self-saliency feature) 

[0058] Many background regions tend to contact one or more of the image borders. In other words, a region that 
has significant amount of its contour on the image borders tends to belong to the background. The percentage of the 
contour points on the image borders and the number of image borders shared (at most four) can be good indications of 

55 the background. L . . 

[0059] In the case where the orientation is unknown, one borderness feature places each region in one of six cat- 
egories determined by the number and configuration of image borders the region is "in contact" with. A region is "in con- 
tact" with a border when at least one pixel in the region falls within a fixed distance of the border of the image. Distance 
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C iii^jj for orientation-independ^borderness.a. 
Category I The region is in contact w.i...~ 

none of the image borders 
exactly one of the image borders 
exactly two of the image borders, adjacent to one another 
exactly two of the image borders, opposite to one another 
exactly three of the image borders 
exactly four (all) of the image borders 



so 



25 



^ssssssssxxsss =v- - ~ 

tion. 



Table 2 



30 



Categories for orientation-dependent 

bordernessja. 



The ronton is in contact with... Category 



35 



40 



45 



50 



55 



Tbp r 


Bottom 


Left 


Right j 


Category 


N j 


^ 1 


N 




0 


N 


Y 1 


N 


N 


1 


Y 


N 1 


N 


N 


2 


N 


N 


Y 


N 


3 


N 


N 


N 


Y 1 


3 


N 


Y 


Y 1 


N 


4 


N 


Y 


N 


Y 


4 


Y 1 


i N 


1 N ' 


1 N 


I 5 


Y 


1 N 


N 


N 


5 


Y 


Y 


1 N 


N 


I 6 


N 


N 


Y 


Y 


I 7 


N 


Y 


Y 


1 Y 


I 8 


Y 


Y 


Y 


N 


9 


I Y 


Y 


N 


1 Y 


9 


Y 


N 


Y 


Y 


10 


Y 


Y 


Y 


Y 


11 



10 



5DOC1D <EP. 



1017019A2_L> 



EP 1017 019 A2 

[0061 ] Regions that include a large fraction of the image border are also likely to be background regions. This fea- 
ture indicates what fraction of the image border is in contact with the given region. 

perimeter pixel sJn_this_region M2 \ 
borderness_b = 2 *(image_height+image_width-2) 1 ' 

[0062] When a large fraction of the region perimeter is on the image border, a region is also likely to be background. 
Such a ratio is unlikely to exceed 0.5. so a value in the range [0.1] is obtained by scaling the ratio by a factor of 2 and 
saturating the ratio at the value of 1 .0. 

mind, 2 * num region_j)erimeter_s>ixels_on_border) M3 x 
borderness_c = I ~ region_perimeter 1 * } 

[0063] Again, note that instead of a composite borderness measure based on heuristics, all the above three bor : 
derness measures are separately trained and used in the main subject detection. 

Semantic saliency features 

• flesh/face/people (foreground, self saliency features) 

[0064] A majority of photographic images have people and about the same number of images have sizable faces 
in them. In conjunction with certain shape analysis and pattern analysis, some detected flesh regions can be identified 
as faces. Subsequently, using models of human figures, flesh detection and face detection can lead to clothing detec- 
tion and eventually people detection. 

[0065] The current flesh detection algorithm utilizes color image segmentation and a pre-determined flesh distribu- 
tion in a chrominance space (Lee, "Color image quantization based on physics and psychophysics.- Journal of Society 
of Photographic Science and Technology of Japan, Vol. 59, No. 1 , pp. 21 2-225, 1 996). The flesh region classification is 
based on Maximum Ukelihood Estimation (MLE) according to the average color of a segmented region. The conditional 
probabilities are mapped to a belief value via a sigmoid belief function. 

[0066] A primitive face detection algorithm is used in the present invention. It combines the flesh map output by the 
flesh detection algorithm with other face heuristics to output a belief in the location of faces in an image. Each region in 
an image that is identified as a flesh region is fitted with an ellipse. The major and minor axes of the ellipse are calcu- 
lated as also the number of pixels in the region outside the ellipse and the number of pixels in the ellipse not part of the 
region. The aspect ratio is computed as a ratio of the major axis to the minor axis. The belief for the face is a function 
of the aspect ratio of the fitted ellipse, the area of the region outside the ellipse, and the area of the ellipse not part of 
the region. A Gaussian belief sensor function is used to scale the raw function outputs to beliefs. 
[0067] It will be appreciated that a person of ordinary skill in the art can use a different face detection method with- 
out departing from the present invention. 

• key background subject matters (self saliency features) 

[0068] There are a number of objects that frequently appear in photographic images, such as sky. cloud, grass, 
tree, foliage, vegetation, water body (river, lake, pond), wood, metal, and the like. Most of them have high likelihood to 
be background objects. Therefore, such objects can be ruled out while they also serve as precursors for main subjects 
as well as scene types. 

[0069] Among these background subject matters, sky and grass (may include other green vegetation) are detected 
with relatively high conf idence due to the amount of constancy in terms of their color, texture, spatial extent, and spatial 
location. 

Probabilistic Reasoning 

[0070] All the saliency features are integrated by a Bayes net to yield the likelihood of main subjects. On one hand, 
different evidences may compete with or contradict each other. On the other hand, different evidences may mutually 
reinforce each other according to prior models or knowledge of typical photographic scenes. Both competition and rein- 
forcement are resolved by the Bayes net-based inference engine. 

[0071] A Bayes net (J. Pearl. Probabilistic Reasoning in Intelligent Systems, San Francisco, CA: Morgan 
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orach The direction of links represents causality. It s an char acterization, fast and efficient com- 

S^SSSS Bayes net consists of four components: 

[0072] Beferring to Fig. , a two-leve, Bayesian ^^J^^^^^^ 
peSence between various ^^^^^^t^ region (identified by the segmentation afgo- 
Setectors are at the leaf nodes 22. There « ™ region being part of the main sub,ect ft is to be 

rithm) in the image. The root " od ^^£^^£ toat has more than two levels without departing 
understood that the present invention can be used witn a Bay 
from the scope of the present invention. 
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• \ n the number of observations (observers) 

where I is the set of all training images. R, is the set of all regions .n image , n, « L-level ground-truth vector, 

i «U suSac, »»n ao. a ™,n s*iact 
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The trained CPM. 




Feature = 1 


feature = 0 


Main subject = 1 
Main subject = 0 


0.35 (11) 
0.17(10) 


0.65 (01) 
0.83 (00) 



[0077] The output of the algorithm is in the form of a list of segmented regions ranked in a descending order of their 
likelihood as potential main subjects for a generic or specific application. Furthermore, this list can be converted into a 
map in which the brightness of a region is proportional to the main subject belief of the region. This "belief map is more 
than a binary map that only indicates location of the determined main subject. The associated likelihood is also 
attached to each region so that the regions with large brightness values correspond to regions with high conf idence or 
belief being part of the main subject. This reflects the inherent uncertainty for humans to perform such a task. However, 
a binary decision, when desired, can be readily obtained by applying an appropriate threshold to the belief map. More- 
over, the belief information may be very useful for downstream applications. For example, different weighting factors can 
be assigned to different regions in determining bit allocation for image coding. 
[0078] Other aspects of the invention include: 

1. The method wherein the step of extracting for each of the regions at least one structural saliency feature and at 
least one semantic saliency feature includes using an extended neighborhood window to compute a plurality of the 
relative saliency features, wherein the extended neighborhood window is determined by the steps of: 

(d) finding a minimum bounding rectangle of a region; 

(c2) stretching the minimum bounding rectangle in all four directions proportionally; and 

(c3) defining all regions intersecting the stretched minimum bounding rectangle as neighbors of the region. 

2 The method as in claim 4, wherein the step of extracting for each of the regions at least one structural saliency 
feature and at least one semantic saliency feature includes using a centralty as the location feature, wherein the 
centrality feature is computed by the steps of: 

(d) determining a probability density function of main subject locations using a collection of training data; 

(c2) computing an integral of the probability density function over an area of a region; and, 

(c3) obtaining a value of the centrality feature by normalizing the integral by the area of the region. 
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16 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using an extended neighborhood window to compute a plurality of 
the relative saliency features, wherein the extended neighborhood window is determined by the steps of: 

(c1) f inding a minimum bounding rectangle of a region; 

(c2) stretching the minimum bounding rectangle in all four directions proportionally; and, 

(c3) defining all regions intersecting the stretched minimum bounding rectangle as neighbors of the region. 

17 The method wherein the step of extracting for each of the regions ate least one structural saliency feature and 
at least one semantic saliency feature includes using a centralty as the location feature, wherein the centralrty fea- 
ture is computed by the steps of: 

(d) determining a probability density function of main subject locations using a collection of training data; 

(c2) computing an integral of the probability density function over an area of a region; and, 

(c3) obtaining a value of the centrality feature by normalizing the integral by the area of the region. 

18 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using a hyperconvexity as the convexity feature, wherein the hyper- 
convexity feature is computed as a ratio of a perimeter-based convexity measure and an area-based convexity 
measure. 

19 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes computing a maximum fraction of a region perimeter shared wrth a 
neighboring region as the surroundedness feature. 

20 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using an orientation-unaware borderness feature as the borderness 
feature wherein the orientation-unaware borderness feature is categorized by the number and configuration of 
image borders a region is in contact with, and all image borders are treated equally. 

21 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using an orientation-aware borderness feature as the borderness 
feature, wherein the orientation-aware borderness feature is categorized by the number and configuration of image 
borders a region is in contact with, and each image border is treated differently. 

22 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using the borderness feature that is determined by what fraction of 
an image border is in contact with a region. 

23 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes using the borderness feature that is determined by what fraction of 
a region border is in contact with an image border. 

24 The method wherein the step of integrating the structural saliency feature and the semantic feature using a 
probabilistic reasoning engine into an estimate of a belief that each region is the main subject includes us.ng a 
Bayes net as the reasoning engine. 

25 The method wherein the step of integrating the structural saliency feature and the semantic feature using a 
probabilistic reasoning engine into an estimate of a belief that each region is the main subject includes using a con- 
ditional probability matrix that is determined by using fractional frequency counting according to a collection of train- 



ing data. 



26 The method wherein the step of integrating the structural saliency feature and the semantic feature using a 
probabilistic reasoning engine into an estimate of a belief that each region is the main subject includes using a 
belief sensor function to convert a measurement of a feature into evidence, which is an input to a Bayes net. 

27 The method wherein the step of extracting for each of the regions at least one structural saliency feature and 
at least one semantic saliency feature includes outputting a belief map. which indicates a location of and a belief in 
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Claims 

5 



1 . A method for detecting a main subject in an image, the method comprising the steps of. 

a) receiving a digital image; ^ inar i artual obiects from the digital image; 

3 - tSi~ - •» — — » fca,,,r9 ' 

srsr is -jsset ~~ — - — - — — - 

tion as the semantic saliency feature. 
. The method as in daim 1 . wherein step (d) includes using a ejection of human opinions to train the reason.ng 

*nn in claim 1 wherein step (c) inCudes using either individually or in combination a seH-saliency fea- 

30 7 The method as in claim 1 . wherein step (d) includes using a Bayes net as the reasoning engine. 

^ • um 1 wherein step (d) includes using a conditional probability matrix that is determined by 

8 - ~ = — to a — " of trainin9 date - 

35 9. A method tor detective main subjects 

3 »So^« -re and at .east one semantic sa„ency 

JSSSi »e structure saliency feature and the semantic feature using a probabilistic reasoning eng.ne 
fnto an estimate of a belief that each region Is the ma.n subject 

^ n claim 9 wherein step (c) includes using either individual or in combination non-purpos,ve 
10. The method as in claim 9, wnerem vw 
grouping and purposive grouping. 
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